Analyzing The Data

Before starting a data analysis, it’s a good idea to brain dump a series of questions you hope to answer based on the variables you know you’re working with. For this toolkit, we’re going to answer the following:

This is not an exhaustive list of the questions you could answer with this data, but it’s a good introduction to sorting through large data sets. Let’s take a look at these questions one at a time.

1. Setup again

As with your cleaning file, you need to import the libraries that allow you to run certain functions. We’ll be loading the same ones. Copy the code below and run it.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(lubridate)
library(stringr)

2. Importing the clean data

Copy and run the code below to import the table we saved in the previous step. To view the table again, type “final_table” on a new line in the code chunk and run it again. There should be 14,190 rows of data with nine columns: people_id, name, party, role, district, bill_id, bill_number, date_filed, and description. Right now, you can see the data is listed in alphabetical order by lawmaker last name, which is why Rep. Alma Allen comes up for the first 50 rows. In the next few steps, we’ll use different functions to rearrange the table and look at the data in new ways!

final_table <- read_rds("data-processed/bills-89.rds")

final_table

3. Lawmakers: most vs. least filed

Now let’s get into our first question: which lawmaker filed the most bills this session? To figure this out, we’re going to use three functions — group_by, summarize and arrange.

  • “group_by” allows us to sort the data into specific groups based on variables that we want to focus on; in this case, we want to group by name, party, role, and district.

  • “summarize” computes a summary of the data usually by a mean or a sum; we’ll just be focusing on the sum aspect because we want to count the number of bills each lawmaker has filed.

  • “arrange” allows us to determine how we want to data to be presented; we can use this function to order the lawmaker names from most filed to least filed

It’s very common to use these three functions together to run basic computations of data. So let’s apply it to our data set! Run the code below to find out who filed the most bills.

3.1 Doing the most first

final_table |> 
  group_by(name, party, role, district) |> 
  summarise(number_filed = n()) |> 
  arrange(desc(number_filed))
`summarise()` has grouped output by 'name', 'party', 'role'. You can override
using the `.groups` argument.
Tip

If you try to run the code and you get an error, check the syntax. Do all of your indents match mine? Did you accidentally add in a comma or a parenthesis? Code syntax is extremely important, and most code errors are from very small mistakes in the typing.

Looks like Sen. Bryan Hughes filed the most this session, with 276 bills. If you click through the table, you’ll see the values in the “number_filed” column decrease. That’s because we added the “desc()” element to our arrange function, which means we want it arranged in descending order.

3.2 Just the top of the list

We’re looking at a lot of rows here, 180 to be exact, so what if we wanted to easily focus on the top 10 lawmakers who have filed the most this session? We can do this with the head() function. Run the code again, this time with one extra line at the end.

most_filed <- final_table |> 
  group_by(name, party, role, district) |> 
  summarise(number_filed = n()) |> 
  arrange(desc(number_filed)) |> 
  head(10)
`summarise()` has grouped output by 'name', 'party', 'role'. You can override
using the `.groups` argument.
most_filed

Because we put the number 10 inside the head function, the table now only shows the first 10 rows – or top 10 lawmakers based on number of bills filed. Feel free to mess around with the number inside the head function and see how your result changes.

3.3 Now the least

We can easily see who filed the least bills with one small change to the code above. The arrange() function defaults to ascending order, so least filed to most filed, when we don’t have that desc() specification in the code. So all we need to do is take that out and run it!

final_table |> 
  group_by(name, party, role, district) |> 
  summarise(number_filed = n()) |> 
  arrange(number_filed)
`summarise()` has grouped output by 'name', 'party', 'role'. You can override
using the `.groups` argument.

Rep. Ramon Romero Jr. filed the least this session, with only 10 bills. Like before, let’s just focus on the top 10 rows, which in this case would be the 10 lawmakers who filed the least bills this session. Copy and run the code below.

final_table |> 
  group_by(name, party, role, district) |> 
  summarise(number_filed = n()) |> 
  arrange(number_filed) |> 
  head(10)
`summarise()` has grouped output by 'name', 'party', 'role'. You can override
using the `.groups` argument.

Congrats! You just answered the first questions from the brainstorm. And hopefully you’re starting to realize how you can organize a data table and look at it in different ways.

4. Districts

Now that we know who filed the most bills this session, we want to figure out which districts they represent. Our data set has a Texas house or senate district (HD/SD) associated with each lawmaker, so we can use a similar method as above to quickly see where the most “active” lawmakers come from.

final_table |> 
  group_by(name, district) |> 
  count(district) |> 
  arrange(desc(n))

5. Bills per month

final_table |> 
  group_by(status_date) |> 
  summarise()
Note

November and March are both incomplete months in the data set because filing opened on Nov. 12 and closed on March 14. Keep this in mind when analyzing your results in this section.

6. Policy topics // Status tracking

Trying to organize status by type of legislation (HB, SB, HR, SR, HJR, SJR, HCR, SCR)

sb_status <- final_table |> 
  filter(grepl("SB", bill_number, ignore.case = TRUE)) |> 
  count(status_desc) |> 
  mutate(chamber = "Sen", bill_type = "Bill") 

hb_status <- final_table |> 
  filter(grepl("HB", bill_number, ignore.case = TRUE)) |> 
  count(status_desc) |> 
  mutate(chamber = "House", bill_type = "Bill")

sr_status <- final_table |> 
  filter(grepl("SR", bill_number, ignore.case = TRUE)) |> 
  count(status_desc) |> 
  mutate(chamber = "Sen", bill_type = "Resolution")

hr_status <- final_table |> 
  filter(grepl("HR", bill_number, ignore.case = TRUE)) |> 
  count(status_desc) |> 
  mutate(chamber = "House", bill_type = "Resolution") 

hjr_status <- final_table |> 
  filter(grepl("HJR", bill_number, ignore.case = TRUE)) |> 
  count(status_desc) |> 
  mutate(chamber = "House", bill_type = "Joint Res")

sjr_status <- final_table |> 
  filter(grepl("SJR", bill_number, ignore.case = TRUE)) |> 
  count(status_desc) |> 
  mutate(chamber = "Sen", bill_type = "Joint Resolution")

hcr_status <- final_table |> 
  filter(grepl("HCR", bill_number, ignore.case = TRUE)) |> 
  count(status_desc) |> 
  mutate(chamber = "House", bill_type = "Concurrent Resolution")

scr_status <- final_table |> 
  filter(grepl("SCR", bill_number, ignore.case = TRUE)) |> 
  count(status_desc) |> 
  mutate(chamber = "Sen", bill_type = "Concurrent Resolution")
final_table |> 
  filter(grepl("abortion", description, ignore.case = TRUE))
final_table |> 
  filter(grepl("immigration", description, ignore.case = TRUE))
final_table |> 
  filter(grepl("diversity", description, ignore.case = TRUE))
final_table |> 
  filter(grepl("energy", description, ignore.case = TRUE))
final_table |> 
  filter(grepl("cannabis|marijuana", description, ignore.case = TRUE))
final_table |> 
  filter(grepl("prayer", description, ignore.case = TRUE))
Reminder

The data we’re using for this section comes from the bill description only, not the full bill text. This is important to keep in mind as we search for key words.

7. This session vs. last session

8. Saving tables for charts

write_csv(most_filed, "data-processed/most-filed.csv")

9. Recap of what we learned